Omics - Analysis of large-scale biomolecular datasets Introduction to R
KIMN20 - LTH
2025-10-30
Introduction to R
Course date: 04 November 2025
Last modified: 2025-10-30
Welcome to R Programming! 🐧
This presentation will teach you the fundamentals of R programming for data analysis.
What we’ll cover:
R language basics and getting help
Working with packages
Variables and data types
Data structures (vectors, matrices, lists, data frames)
Reading data from files
Writing functions
What R is?
R is a programming language and environment designed for statistical computing and graphics.
Key Characteristics:
Programming Language : High-level language for data analysis and visualization
Programming Platform : Complete environment with interpreter and development tools
Open-Source Project : Driven by the R core team and global community
Statistical Powerhouse : Specialized for statistical analysis and modeling
General-Purpose Tool : Can handle diverse computational tasks
Idea-to-Implementation Bridge : Transforms concepts into working solutions
What R is NOT:
A replacement for statistical expertise
The “best” programming language for every task
Always the most elegant or efficient solution
Yet R Excels At:
Statistical computing and data analysis
High-quality graphics and visualization
Rapid prototyping of analytical methods
Integration with other scientific tools
Getting Started with R
Opening R/RStudio
Command line : Type R in terminal
RStudio : Download from https://posit.co/download/rstudio-desktop/
Web-based : Use RStudio Cloud or Google Colab with R kernel
RStudio Server :https://130.235.8.214/rstudio
Your First R Commands
# This is a comment - it won't execute
print ("Hello, R!" )
Try running this command in your R console!
Getting Help
Built-in Help Functions
# Get help for a function
?print
help (print)
# Search for functions
??plot
help.search ("plot" )
Help for Packages
# Get help for installed packages
help (package = "base" )
# Vignettes (detailed tutorials)
vignette ()
# vignette("ggplot2") # Commented out to prevent browser opening
Working with Packages
Installing Packages
# Commented out to prevent browser opening
# Install from CRAN (main repository)
install.packages ("tidyverse" )
# Install multiple packages
install.packages (c ("dplyr" , "ggplot2" , "readr" ))
Loading Packages
# Load a package
library (tidyverse)
# Load with require() (returns TRUE/FALSE)
require (tidyverse)
Package Management
# See loaded packages
sessionInfo ()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.7.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Stockholm
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 jsonlite_1.8.9 compiler_4.4.1 tidyselect_1.2.1
[5] scales_1.3.0 yaml_2.3.10 fastmap_1.2.0 R6_2.5.1
[9] generics_0.1.3 knitr_1.50 munsell_0.5.1 pillar_1.9.0
[13] tzdb_0.4.0 rlang_1.1.4 utf8_1.2.4 stringi_1.8.4
[17] xfun_0.53 timechange_0.3.0 cli_3.6.3 withr_3.0.2
[21] magrittr_2.0.3 digest_0.6.37 grid_4.4.1 hms_1.1.3
[25] lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.1 glue_1.8.0
[29] fansi_1.0.6 colorspace_2.1-1 rmarkdown_2.30 tools_4.4.1
[33] pkgconfig_2.0.3 htmltools_0.5.8.1
# Update packages
update.packages ()
# Remove packages
remove.packages ("package_name" )
Bioconductor
Bioconductor is the premier repository for R packages in bioinformatics and computational biology.
Key Features:
Specialized Tools : Packages for genomics, proteomics, sequencing analysis
Quality Assurance : Rigorous review process ensures high-quality, well-documented packages
Integrated Workflows : Complete analysis pipelines from raw data to publication
Active Community : Regular updates and new package releases
Installing Bioconductor Packages
# Install Bioconductor manager (one time)
if (! require ("BiocManager" , quietly = TRUE ))
install.packages ("BiocManager" )
# Install packages
BiocManager:: install ("DESeq2" ) # RNA-seq analysis
BiocManager:: install ("edgeR" ) # Differential expression
BiocManager:: install ("limma" ) # Microarray analysis
BiocManager:: install ("GenomicRanges" ) # Genomic intervals
Popular Bioconductor Packages:
DESeq2 : RNA-seq differential expression analysis
edgeR : Alternative for RNA-seq and count data
limma : Linear models for microarray data
Biostrings : DNA/RNA sequence manipulation
GenomicRanges : Working with genomic coordinates
Variables and Assignment
A variable is a named storage location in memory that holds a value or object. Variables allow you to store, reference, and manipulate data throughout your R session.
Creating Variables
# Assignment operators
x <- 5
y = 10
20 -> z
# Print variables
x
Note : <- is preferred for assignment, = is used for function arguments.
Variable Names
# Valid names
my_variable <- 1
myVariable <- 2
.variable <- 3
variable. <- 4
# Invalid names (would cause errors)
# 1variable <- 5
# my-variable <- 6
# my variable <- 7
Variable Naming Conventions
Rules for valid variable names in R:
Valid Names:
Start with a letter or dot (not followed by number)
Contain letters, numbers, dots, or underscores
Examples: my_var, myVariable, .hidden, var2
Invalid Names:
Start with number: 2variable ❌
Contain spaces: my variable ❌
Use hyphens: my-variable ❌
Start with dot + number: .2way ❌
Variable Naming Conventions
Rules for valid variable names in R:
Reserved Words (cannot use):
if, else, repeat, while, function
for, in, next, break
TRUE, FALSE, NULL, Inf, NaN
NA, NA_integer_, NA_real_, NA_complex_, NA_character_
Built-in functions: c, q, t, C, D, I
Avoid using: T, F (short for TRUE/FALSE)
Variable Naming Style
Best practices for readable, maintainable code:
Use descriptive names: patient_ages instead of x
Avoid cryptic abbreviations: genotypes instead of fsjht45jkhsdf4
Choose a Consistent Convention:
snake_case : my_variable, patient_data
camelCase : myVariable, patientData
dot.case : my.variable, patient.data
Pick one and stick with it throughout your code
Variable Naming Style
Best practices for readable, maintainable code:
Keep Names Reasonable Length:
Short enough: tmp for temporary variables
Long enough: gene_expression_matrix not g_e_m
Avoid: my.variable.2 → use my.variable2
Common Conventions:
i, j, k: Loop counters
tmp: Temporary variables
df: Data frames
n: Counts or lengths
idx: Indices
Data Types and Classes
Basic Data Types
# Numeric (double)
num <- 3.14
class (num)
# Integer
int <- 42 L # L forces the number to be stored as integer
class (int)
# Character (string)
text <- "Hello World"
class (text)
# Logical (boolean)
logic <- TRUE
class (logic)
Type Checking
# Check data types
is.numeric (num)
# Check if it's a specific type
is.double (num)
Type Casting and Conversion
Converting Between Types
# Numeric to character
as.character (3.14 )
# Character to numeric
as.numeric ("42.5" )
# Numeric to integer
as.integer (3.9 )
# Logical to numeric
as.numeric (TRUE ) # Returns 1
as.numeric (FALSE ) # Returns 0
Common Conversions
# String to logical
as.logical ("TRUE" )
# Factor to character
factor_var <- factor (c ("A" , "B" , "A" ))
as.character (factor_var)
Basic String Operations
String Basics
# Create strings
name <- "Alice"
greeting <- 'Hello'
# String length
nchar (name)
# Combine strings
paste ("Hello" , "World" )
paste ("Hello" , "World" , sep = " " )
String Manipulation
# Substring
substr ("Hello World" , 1 , 5 )
# Upper/lower case
toupper ("hello" )
# Split strings
strsplit ("Hello World" , " " )
[[1]]
[1] "Hello" "World"
Complex Data Structures
R provides powerful data structures for organizing and manipulating data efficiently.
Using basic data types (numeric, logical, character), we can construct more complex data structures:
Data Structures in R
Overview of R Data Structures:
Vectors : One-dimensional arrays of the same data type
Matrices : Two-dimensional arrays of the same data type
Lists : Ordered collections that can contain different data types
Data Frames : Two-dimensional tables (like spreadsheets) with mixed data types
These structures form the foundation for data analysis in bioinformatics!
Vectors
Vectors (or Atomic Vectors) are one-dimensional arrays of the same data type.
Creating Vectors
# Numeric vector
numbers <- c (1 , 2 , 3 , 4 , 5 )
numbers
# Character vector
fruits <- c ("apple" , "banana" , "orange" )
fruits
[1] "apple" "banana" "orange"
# Logical vector
booleans <- c (TRUE , FALSE , TRUE )
booleans
Vector Operations
# Vector arithmetic
x <- c (1 , 2 , 3 )
y <- c (4 , 5 , 6 )
x + y
# Vector functions
length (numbers)
Vector Indexing
# Access elements
fruits[1 ] # First element
fruits[2 : 3 ] # Second and third
fruits[c (1 ,3 )] # First and third
# Negative indexing (exclude)
fruits[- 1 ] # All except first
fruits[- c (1 ,3 )] # Exclude first and third
Named Vectors
# Create named vector
ages <- c (Alice = 25 , Bob = 30 , Carol = 35 )
ages
# Access by name
ages["Alice" ]
ages[c ("Alice" , "Carol" )]
Matrices
Matrices are two-dimensional arrays of the same data type.
Creating Matrices
# Create matrix from vector
matrix (1 : 9 , nrow = 3 , ncol = 3 )
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
# By row or column
matrix (1 : 6 , nrow = 2 , byrow = TRUE )
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
matrix (1 : 6 , nrow = 2 , byrow = FALSE )
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Matrix Operations
# Create sample matrix
m <- matrix (1 : 4 , nrow = 2 )
m
[,1] [,2]
[1,] 1 3
[2,] 2 4
# Matrix dimensions
dim (m)
Matrix Indexing
# Access elements
m[1 , 2 ] # Row 1, Column 2
# Multiple elements
m[1 : 2 , 1 ] # Rows 1-2, Column 1
Matrix Arithmetic
# Matrix operations
m1 <- matrix (1 : 4 , nrow = 2 )
m2 <- matrix (5 : 8 , nrow = 2 )
m1 + m2
[,1] [,2]
[1,] 6 10
[2,] 8 12
[,1] [,2]
[1,] 5 21
[2,] 12 32
m1 %*% m2 # Matrix multiplication
[,1] [,2]
[1,] 23 31
[2,] 34 46
Lists
Lists are ordered collections that can contain different data types.
Creating Lists
# Simple list
my_list <- list (1 , "hello" , TRUE )
my_list
[[1]]
[1] 1
[[2]]
[1] "hello"
[[3]]
[1] TRUE
Creating Lists
# Named list
person <- list (
name = "Alice" ,
age = 25 ,
scores = c (85 , 90 , 88 )
)
person
$name
[1] "Alice"
$age
[1] 25
$scores
[1] 85 90 88
Accessing List Elements
# By index
person[[1 ]] # First element
person[[3 ]] # Third element
List Operations
# List length
length (person)
# Names of elements
names (person)
[1] "name" "age" "scores"
# Add elements
person$ city <- "Stockholm"
person
$name
[1] "Alice"
$age
[1] 25
$scores
[1] 85 90 88
$city
[1] "Stockholm"
List Operations
# Remove elements
person$ city <- NULL
person
$name
[1] "Alice"
$age
[1] 25
$scores
[1] 85 90 88
Data Frames
Data Frames are two-dimensional tables (like spreadsheets) with mixed data types
Creating Data Frames
# Create data frame
students <- data.frame (
name = c ("Alice" , "Bob" , "Carol" ),
age = c (20 , 21 , 19 ),
grade = c ("A" , "B" , "A" )
)
students
name age grade
1 Alice 20 A
2 Bob 21 B
3 Carol 19 A
Data Frame Properties
# Structure
str (students)
'data.frame': 3 obs. of 3 variables:
$ name : chr "Alice" "Bob" "Carol"
$ age : num 20 21 19
$ grade: chr "A" "B" "A"
# Dimensions
dim (students)
# Column names
names (students)
Data Frame Indexing
# Access columns
students
name age grade
1 Alice 20 A
2 Bob 21 B
3 Carol 19 A
[1] "Alice" "Bob" "Carol"
students[, 2 ] # Second column
# Access rows
students[1 , ] # First row
name age grade
1 Alice 20 A
students[1 : 2 , ] # First two rows
name age grade
1 Alice 20 A
2 Bob 21 B
name age grade
1 Alice 20 A
2 Bob 21 B
3 Carol 19 A
# Access specific elements
students[1 , 2 ] # Row 1, Column 2
students[1 , "age" ] # Row 1, age column
Data Frame Operations
# Add column
students$ passed <- c (TRUE , TRUE , FALSE )
students
name age grade passed
1 Alice 20 A TRUE
2 Bob 21 B TRUE
3 Carol 19 A FALSE
# Summary statistics
summary (students)
name age grade passed
Length:3 Min. :19.0 Length:3 Mode :logical
Class :character 1st Qu.:19.5 Class :character FALSE:1
Mode :character Median :20.0 Mode :character TRUE :2
Mean :20.0
3rd Qu.:20.5
Max. :21.0
Data Frame Operations
# View first/last rows
head (students)
name age grade passed
1 Alice 20 A TRUE
2 Bob 21 B TRUE
3 Carol 19 A FALSE
name age grade passed
1 Alice 20 A TRUE
2 Bob 21 B TRUE
3 Carol 19 A FALSE
Reading Data
Reading CSV Files
# Read CSV file
data <- read.csv ("assets/data.csv" )
head (data)
Gene_Symbol Gene_ID Control_1 Control_2 Treatment_1 Treatment_2
1 GAPDH ENSG00000111640 1250 1180 980 1050
2 ACTB ENSG00000075624 890 920 750 820
3 TP53 ENSG00000141510 450 480 680 720
4 MYC ENSG00000136997 320 350 580 620
5 EGFR ENSG00000146648 180 200 450 480
6 BRCA1 ENSG00000012048 150 170 380 410
Reading CSV Files
Factors in R are a special data type used to represent categorical data with predefined levels. They’re like labeled categories that can be ordered or unordered. stringsAsFactors = FALSE prevents automatic conversion of strings to factors.
# With custom options
data <- read.csv ("assets/data.csv" ,
header = TRUE ,
sep = "," ,
stringsAsFactors = FALSE )
head (data)
Gene_Symbol Gene_ID Control_1 Control_2 Treatment_1 Treatment_2
1 GAPDH ENSG00000111640 1250 1180 980 1050
2 ACTB ENSG00000075624 890 920 750 820
3 TP53 ENSG00000141510 450 480 680 720
4 MYC ENSG00000136997 320 350 580 620
5 EGFR ENSG00000146648 180 200 450 480
6 BRCA1 ENSG00000012048 150 170 380 410
Reading Excel Files
# Install readxl if needed
# install.packages("readxl")
library (readxl)
# Read Excel file
excel_data <- read_excel ("assets/data.xlsx" )
head (excel_data)
# A tibble: 6 × 6
Gene_Symbol Gene_ID Control_1 Control_2 Treatment_1 Treatment_2
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 GAPDH ENSG00000111640 1250 1180 980 1050
2 ACTB ENSG00000075624 890 920 750 820
3 TP53 ENSG00000141510 450 480 680 720
4 MYC ENSG00000136997 320 350 580 620
5 EGFR ENSG00000146648 180 200 450 480
6 BRCA1 ENSG00000012048 150 170 380 410
# Read specific sheet
sheet_data <- read_excel ("assets/data.xlsx" , sheet = "Sheet 1" )
head (sheet_data)
# A tibble: 6 × 6
Gene_Symbol Gene_ID Control_1 Control_2 Treatment_1 Treatment_2
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 GAPDH ENSG00000111640 1250 1180 980 1050
2 ACTB ENSG00000075624 890 920 750 820
3 TP53 ENSG00000141510 450 480 680 720
4 MYC ENSG00000136997 320 350 580 620
5 EGFR ENSG00000146648 180 200 450 480
6 BRCA1 ENSG00000012048 150 170 380 410
# Read tab-separated file
tsv_data <- read.delim ("assets/data.tsv" )
head (tsv_data)
Gene_Symbol Gene_ID Control_1 Control_2 Treatment_1 Treatment_2
1 GAPDH ENSG00000111640 1250 1180 980 1050
2 ACTB ENSG00000075624 890 920 750 820
3 TP53 ENSG00000141510 450 480 680 720
4 MYC ENSG00000136997 320 350 580 620
5 EGFR ENSG00000146648 180 200 450 480
6 BRCA1 ENSG00000012048 150 170 380 410
# Read from URL
# url_data <- read.csv("https://example.com/data.csv")
# Write data
write.csv (students, "students.csv" )
write.csv (students, "students.csv" , row.names = FALSE )
Functions
Functions are reusable blocks of code that perform specific tasks. They help organize code, reduce repetition, and make programs more modular and maintainable.
R Functions Overview
Why Use Functions?
Reusability : Write once, use many times
Organization : Break complex tasks into smaller, manageable pieces
Maintainability : Easier to debug and update code
Abstraction : Hide implementation details
Functions take inputs (arguments), perform operations, and return outputs (results).
Writing Functions
Basic Function Structure
# Simple function
greet <- function (name) {
message <- paste ("Hello," , name, "!" )
return (message)
}
# Call the function
greet ("Alice" )
Functions with Multiple Parameters
# Function with multiple parameters
calculate_bmi <- function (weight_kg, height_m) {
bmi <- weight_kg / (height_m^ 2 )
return (bmi)
}
# Usage
calculate_bmi (70 , 1.75 )
Functions with Default Values
# Function with defaults
power_function <- function (x, power = 2 ) {
result <- x^ power
return (result)
}
# Usage
power_function (3 ) # Uses default power = 2
power_function (3 , 3 ) # Uses power = 3
Functions with Conditional Logic
# Function with if-else
grade_letter <- function (score) {
if (score >= 90 ) {
return ("A" )
} else if (score >= 80 ) {
return ("B" )
} else if (score >= 70 ) {
return ("C" )
} else {
return ("F" )
}
}
# Usage
grade_letter (85 )
Quiz Time! 🧠
Question 1
Which operator is used for assignment in R? - A) = - B) <- - C) Both A and B
Question 2
What function creates a vector in R? - A) vector() - B) c() - C) make_vector()
Question 3
How do you access the first element of a vector named my_vec? - A) my_vec[0] - B) my_vec[1] - C) my_vec.first
Quiz Time! 🧠
(Answers: 1-C, 2-B, 3-B)
Next Steps
What to Learn Next
Data manipulation with dplyr
Data visualization with ggplot2
Statistical analysis functions
Writing R scripts and RMarkdown
Advanced data structures
Working with dates and times
Resources
Thank You!
You’ve completed the basic R programming tutorial!
Remember:
Use ?function_name for help
Install packages with install.packages()
Load packages with library()
Practice regularly with real data
Questions?
Feel free to ask your instructor or classmates!
🐧 Happy R programming! 🐧
Practice Time! 💻
Exercise 1: Variables and Typesß
Create variables for your name, age, and height
Check their data types with class()
Convert your age to character and height to integer
# Your code here
name <- "Your Name"
age <- 25
height <- 1.75
class (name)
Exercise 2: Vectors and Data Frames
Create a vector of your favorite numbers
Create a data frame with names and scores
Calculate the mean of the scores
# Your code here
numbers <- c (1 , 5 , 10 , 15 , 20 )
scores_df <- data.frame (
name = c ("Alice" , "Bob" , "Carol" ),
score = c (85 , 92 , 78 )
)
mean (scores_df$ score)